Portuguese Bank Marketing Stratergy- TPOT Tutorial

The data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be ('yes') or not ('no') subscribed. https://archive.ics.uci.edu/ml/datasets/Bank+Marketing



In [1]:

    
# Import required libraries
from tpot import TPOTClassifier
from sklearn.cross_validation import train_test_split
import pandas as pd 
import numpy as np









    



/usr/local/lib/python2.7/dist-packages/sklearn/cross_validation.py:41: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)



In [2]:

    
#Load the data
Marketing=pd.read_csv('Data_FinalProject.csv')
Marketing.head(5)









    Out[2]:







  
    
      
      age
      job
      marital
      education
      default
      housing
      loan
      contact
      month
      day_of_week
      ...
      campaign
      pdays
      previous
      poutcome
      emp.var.rate
      cons.price.idx
      cons.conf.idx
      euribor3m
      nr.employed
      y
    
  
  
    
      0
      56
      housemaid
      married
      basic.4y
      no
      no
      no
      telephone
      may
      mon
      ...
      1
      999
      0
      nonexistent
      1.1
      93.994
      -36.4
      4.857
      5191.0
      no
    
    
      1
      57
      services
      married
      high.school
      unknown
      no
      no
      telephone
      may
      mon
      ...
      1
      999
      0
      nonexistent
      1.1
      93.994
      -36.4
      4.857
      5191.0
      no
    
    
      2
      37
      services
      married
      high.school
      no
      yes
      no
      telephone
      may
      mon
      ...
      1
      999
      0
      nonexistent
      1.1
      93.994
      -36.4
      4.857
      5191.0
      no
    
    
      3
      40
      admin.
      married
      basic.6y
      no
      no
      no
      telephone
      may
      mon
      ...
      1
      999
      0
      nonexistent
      1.1
      93.994
      -36.4
      4.857
      5191.0
      no
    
    
      4
      56
      services
      married
      high.school
      no
      no
      yes
      telephone
      may
      mon
      ...
      1
      999
      0
      nonexistent
      1.1
      93.994
      -36.4
      4.857
      5191.0
      no
    
  

5 rows × 21 columns

Data Exploration



In [3]:

    
Marketing.groupby('loan').y.value_counts()









    Out[3]:





loan     y  
no       no     30100
         yes     3850
unknown  no       883
         yes      107
yes      no      5565
         yes      683
Name: y, dtype: int64



In [4]:

    
Marketing.groupby(['loan','marital']).y.value_counts()









    Out[4]:





loan     marital   y  
no       divorced  no      3420
                   yes      396
         married   no     18469
                   yes     2098
         single    no      8155
                   yes     1345
         unknown   no        56
                   yes       11
unknown  divorced  no       113
                   yes        8
         married   no       528
                   yes       60
         single    no       241
                   yes       39
         unknown   no         1
yes      divorced  no       603
                   yes       72
         married   no      3399
                   yes      374
         single    no      1552
                   yes      236
         unknown   no        11
                   yes        1
Name: y, dtype: int64

Data Munging

The first and most important step in using TPOT on any data set is to rename the target class/response variable to class.



In [5]:

    
Marketing.rename(columns={'y': 'class'}, inplace=True)

At present, TPOT requires all the data to be in numerical format. As we can see below, our data set has 11 categorical variables which contain non-numerical values: job, marital, education, default, housing, loan, contact, month, day_of_week, poutcome, class.



In [6]:

    
Marketing.dtypes









    Out[6]:





age                 int64
job                object
marital            object
education          object
default            object
housing            object
loan               object
contact            object
month              object
day_of_week        object
duration            int64
campaign            int64
pdays               int64
previous            int64
poutcome           object
emp.var.rate      float64
cons.price.idx    float64
cons.conf.idx     float64
euribor3m         float64
nr.employed       float64
class              object
dtype: object

We then check the number of levels that each of the five categorical variables have.



In [7]:

    
for cat in ['job', 'marital', 'education', 'default', 'housing', 'loan', 'contact', 'month', 'day_of_week', 'poutcome' ,'class']:
    print("Number of levels in category '{0}': \b {1:2.2f} ".format(cat, Marketing[cat].unique().size))









    



Number of levels in category 'job':  12.00 
Number of levels in category 'marital':  4.00 
Number of levels in category 'education':  8.00 
Number of levels in category 'default':  3.00 
Number of levels in category 'housing':  3.00 
Number of levels in category 'loan':  3.00 
Number of levels in category 'contact':  2.00 
Number of levels in category 'month':  10.00 
Number of levels in category 'day_of_week':  5.00 
Number of levels in category 'poutcome':  3.00 
Number of levels in category 'class':  2.00

As we can see, contact and poutcome have few levels. Let's find out what they are.



In [8]:

    
for cat in ['contact', 'poutcome','class', 'marital', 'default', 'housing', 'loan']:
    print("Levels for catgeory '{0}': {1}".format(cat, Marketing[cat].unique()))









    



Levels for catgeory 'contact': ['telephone' 'cellular']
Levels for catgeory 'poutcome': ['nonexistent' 'failure' 'success']
Levels for catgeory 'class': ['no' 'yes']
Levels for catgeory 'marital': ['married' 'single' 'divorced' 'unknown']
Levels for catgeory 'default': ['no' 'unknown' 'yes']
Levels for catgeory 'housing': ['no' 'yes' 'unknown']
Levels for catgeory 'loan': ['no' 'yes' 'unknown']

We then code these levels manually into numerical values. For nan i.e. the missing values, we simply replace them with a placeholder value (-999). In fact, we perform this replacement for the entire data set.



In [9]:

    
Marketing['marital'] = Marketing['marital'].map({'married':0,'single':1,'divorced':2,'unknown':3})
Marketing['default'] = Marketing['default'].map({'no':0,'yes':1,'unknown':2})
Marketing['housing'] = Marketing['housing'].map({'no':0,'yes':1,'unknown':2})
Marketing['loan'] = Marketing['loan'].map({'no':0,'yes':1,'unknown':2})
Marketing['contact'] = Marketing['contact'].map({'telephone':0,'cellular':1})
Marketing['poutcome'] = Marketing['poutcome'].map({'nonexistent':0,'failure':1,'success':2})
Marketing['class'] = Marketing['class'].map({'no':0,'yes':1})



In [10]:

    
Marketing = Marketing.fillna(-999)
pd.isnull(Marketing).any()









    Out[10]:





age               False
job               False
marital           False
education         False
default           False
housing           False
loan              False
contact           False
month             False
day_of_week       False
duration          False
campaign          False
pdays             False
previous          False
poutcome          False
emp.var.rate      False
cons.price.idx    False
cons.conf.idx     False
euribor3m         False
nr.employed       False
class             False
dtype: bool

For other categorical variables, we encode the levels as digits using Scikit-learn's MultiLabelBinarizer and treat them as new features.



In [11]:

    
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()

job_Trans = mlb.fit_transform([{str(val)} for val in Marketing['job'].values])
education_Trans = mlb.fit_transform([{str(val)} for val in Marketing['education'].values])
month_Trans = mlb.fit_transform([{str(val)} for val in Marketing['month'].values])
day_of_week_Trans = mlb.fit_transform([{str(val)} for val in Marketing['day_of_week'].values])



In [12]:

    
day_of_week_Trans









    Out[12]:





array([[0, 1, 0, 0, 0],
       [0, 1, 0, 0, 0],
       [0, 1, 0, 0, 0],
       ..., 
       [1, 0, 0, 0, 0],
       [1, 0, 0, 0, 0],
       [1, 0, 0, 0, 0]])

Drop the unused features from the dataset.



In [13]:

    
marketing_new = Marketing.drop(['marital','default','housing','loan','contact','poutcome','class','job','education','month','day_of_week'], axis=1)



In [14]:

    
assert (len(Marketing['day_of_week'].unique()) == len(mlb.classes_)), "Not Equal" #check correct encoding done



In [15]:

    
Marketing['day_of_week'].unique(),mlb.classes_









    Out[15]:





(array(['mon', 'tue', 'wed', 'thu', 'fri'], dtype=object),
 array(['fri', 'mon', 'thu', 'tue', 'wed'], dtype=object))

We then add the encoded features to form the final dataset to be used with TPOT.



In [16]:

    
marketing_new = np.hstack((marketing_new.values, job_Trans, education_Trans, month_Trans, day_of_week_Trans))



In [17]:

    
np.isnan(marketing_new).any()









    Out[17]:





False

Keeping in mind that the final dataset is in the form of a numpy array, we can check the number of features in the final dataset as follows.



In [18]:

    
marketing_new[0].size









    Out[18]:





45

Finally we store the class labels, which we need to predict, in a separate variable.



In [19]:

    
marketing_class = Marketing['class'].values

Data Analysis using TPOT

To begin our analysis, we need to divide our training data into training and validation sets. The validation set is just to give us an idea of the test set error. The model selection and tuning is entirely taken care of by TPOT, so if we want to, we can skip creating this validation set.



In [20]:

    
training_indices, validation_indices = training_indices, testing_indices = train_test_split(Marketing.index, stratify = marketing_class, train_size=0.75, test_size=0.25)
training_indices.size, validation_indices.size









    Out[20]:





(30891, 10297)

After that, we proceed to calling the fit(), score() and export() functions on our training dataset. An important TPOT parameter to set is the number of generations (via the generations kwarg). Since our aim is to just illustrate the use of TPOT, we assume the default setting of 100 generations, whilst bounding the total running time via the max_time_mins kwarg (which may, essentially, override the former setting). Further, we enable control for the maximum amount of time allowed for optimization of a single pipeline, via max_eval_time_mins.

On a standard laptop with 4GB RAM, each generation takes approximately 5 minutes to run. Thus, for the default value of 100, without the explicit duration bound, the total run time could be roughly around 8 hours.



In [21]:

    
tpot = TPOTClassifier(verbosity=2, max_time_mins=2, max_eval_time_mins=0.04, population_size=15)
tpot.fit(marketing_new[training_indices], marketing_class[training_indices])









    



Warning: xgboost.XGBClassifier is not available and will not be used by TPOT.






    



Optimization Progress: 49pipeline [00:46,  1.10s/pipeline]                  





    



Generation 1 - Current best internal CV score: 0.913728927925






    



Optimization Progress: 71pipeline [01:11,  1.06s/pipeline]





    



Generation 2 - Current best internal CV score: 0.913728927925






    



Optimization Progress: 95pipeline [01:38,  1.28s/pipeline]





    



Generation 3 - Current best internal CV score: 0.913728927925






    



Optimization Progress: 115pipeline [01:58,  1.04s/pipeline]





    



Generation 4 - Current best internal CV score: 0.913728927925






    



                                                           






    



2.00407131667 minutes have elapsed. TPOT will close down.
TPOT closed prematurely. Will use the current best pipeline.

Best pipeline: DecisionTreeClassifier(input_matrix, criterion=gini, max_depth=5, min_samples_leaf=16, min_samples_split=8)






    Out[21]:





TPOTClassifier(config_dict={'sklearn.ensemble.GradientBoostingClassifier': {'max_features': array([ 0.05,  0.1 ,  0.15,  0.2 ,  0.25,  0.3 ,  0.35,  0.4 ,  0.45,
        0.5 ,  0.55,  0.6 ,  0.65,  0.7 ,  0.75,  0.8 ,  0.85,  0.9 ,
        0.95,  1.  ]), 'learning_rate': [0.001, 0.01, 0.1, 0.5, 1.0], 'min_samples_... 0.7 ,  0.75,  0.8 ,  0.85,  0.9 ,
        0.95,  1.  ])}, 'sklearn.preprocessing.RobustScaler': {}},
        crossover_rate=0.1, cv=5, disable_update_check=False,
        early_stop=None, generations=1000000, max_eval_time_mins=0.04,
        max_time_mins=2, mutation_rate=0.9, n_jobs=1, offspring_size=15,
        periodic_checkpoint_folder=None, population_size=15,
        random_state=None, scoring=None, subsample=1.0, verbosity=2,
        warm_start=False)

In the above, 4 generations were computed, each giving the training efficiency of fitting model on the training set. As evident, the best pipeline is the one that has the CV score of 91.373%. The model that produces this result is one that fits a decision tree algorithm on the data set. Next, the test error is computed for validation purposes.



In [22]:

    
tpot.score(marketing_new[validation_indices], Marketing.loc[validation_indices, 'class'].values)









    Out[22]:





0.91628629697970287



In [23]:

    
tpot.export('tpot_marketing_pipeline.py')









    Out[23]:





True



In [ ]:

    
# %load tpot_marketing_pipeline.py
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier

# NOTE: Make sure that the class is labeled 'target' in the data file
tpot_data = pd.read_csv('PATH/TO/DATA/FILE', sep='COLUMN_SEPARATOR', dtype=np.float64)
features = tpot_data.drop('target', axis=1).values
training_features, testing_features, training_target, testing_target = \
            train_test_split(features, tpot_data['target'].values, random_state=42)

# Score on the training set was:0.913728927925
exported_pipeline = DecisionTreeClassifier(criterion="gini", max_depth=5, min_samples_leaf=16, min_samples_split=8)

exported_pipeline.fit(training_features, training_target)
results = exported_pipeline.predict(testing_features)



In [ ]:



In [ ]:



In [ ]:

	age	job	marital	education	default	housing	loan	contact	month	day_of_week	...	campaign	pdays	poutcome	emp.var.rate	cons.price.idx	cons.conf.idx	euribor3m	nr.employed	y
0	56	housemaid	married	basic.4y	no	no	no	telephone	may	mon	...	1	999	nonexistent	1.1	93.994	-36.4	4.857	5191.0	no
1	57	services	married	high.school	unknown	no	no	telephone	may	mon	...	1	999	nonexistent	1.1	93.994	-36.4	4.857	5191.0	no
2	37	services	married	high.school	no	yes	no	telephone	may	mon	...	1	999	nonexistent	1.1	93.994	-36.4	4.857	5191.0	no
3	40	admin.	married	basic.6y	no	no	no	telephone	may	mon	...	1	999	nonexistent	1.1	93.994	-36.4	4.857	5191.0	no
4	56	services	married	high.school	no	no	yes	telephone	may	mon	...	1	999	nonexistent	1.1	93.994	-36.4	4.857	5191.0	no